-
Notifications
You must be signed in to change notification settings - Fork 170
fix: Always accept {Expr,Series}.is_in(other: Iterable)
#3207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Using 4/16 constructors is enough to demonstrate the bug
Expr.is_in(other: Iterable) Expr.is_in(other: Iterable)
Preferring to use thrid-party conversions whenever possible
Expr.is_in(other: Iterable) {Expr,Series}.is_in(other: Iterable)
| class _CanTo_List(Protocol): # noqa: N801 | ||
| def to_list(self, *args: Any, **kwds: Any) -> list[Any]: ... | ||
|
|
||
|
|
||
| class _CanToList(Protocol): | ||
| def tolist(self, *args: Any, **kwds: Any) -> list[Any]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I hate this as well π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better idea would be to ...
Rename _CanTo_List -> ToList, and move to _translate.py alongside:
narwhals/narwhals/_translate.py
Lines 124 to 125 in f05011c
| class ToDict(Protocol[ToDictDT_co]): | |
| def to_dict(self, *args: Any, **kwds: Any) -> ToDictDT_co: ... |
Move these as well, but rename to reflect they their naming originates from numpy and pyarrow (respectively):
Lines 2131 to 2132 in 0c66432
| class _CanToList(Protocol): | |
| def tolist(self, *args: Any, **kwds: Any) -> list[Any]: ... |
Lines 2135 to 2136 in 0c66432
| class _CanTo_PyList(Protocol): # noqa: N801 | |
| def to_pylist(self, *args: Any, **kwds: Any) -> list[Any]: ... |
The names of the guards can still stay the same, since their implementations will (after updating the protocol names) the link between origin, protocol, method name:
Lines 2139 to 2144 in 0c66432
| def can_to_list(obj: Any) -> TypeIs[_CanTo_List]: | |
| return ( | |
| is_narwhals_series(obj) | |
| or is_polars_series(obj) | |
| or _hasattr_static(obj, "to_list") | |
| ) |
Lines 2147 to 2148 in 0c66432
| def can_tolist(obj: Any) -> TypeIs[_CanToList]: | |
| return is_numpy_array_1d(obj) or _hasattr_static(obj, "tolist") |
Lines 2151 to 2154 in 0c66432
| def can_to_pylist(obj: Any) -> TypeIs[_CanTo_PyList]: | |
| return ( | |
| (pa := get_pyarrow()) and isinstance(obj, (pa.Array, pa.ChunkedArray)) | |
| ) or _hasattr_static(obj, "to_pylist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated my comment (#3207 (comment)) - I would argue that native series are not ok, while numpy 1d arrays are.
My thought process for this is that if someone is doing something along the lines of:
import narwhals as nw
import polars as pl
def agnostic_func(frame: IntoDataFrameT) -> IntoDataFrameT:
other = pl.Series([1, 2, 3]) # <- notice how this is a native series!!!
return nw.from_native(frame).filter(nw.col("x").is_in(other)).to_native()then the function is clearly not agnostic and polars would be required in this case.
A different case would be if a narwhals series with a different backend is provided. This could mean that the function is agnostic but a user is "mixin" backends:
def is_left_in_right(left_series: IntoSeriesT, right_series: IntoSeriesT) -> IntoSeriesT:
left_nw = nw.from_native(left_series, series_only=True)
right_nw = nw.from_native(right_series, series_only=True)
return left_nw.is_in(right_nw).to_native()
# but now it a user to mix it up, not the library itself
is_left_in_right(pl.Series([1,2,3]), pd.Series([0, 1]))This is the case I suggested to yell at the user with a warning.
From our side, I think it would greatly simplify (read as, get rid of) most of the protocols here, the type guards as well as iterable_to_sequence function.
| _into_iter: Callable[[int], Iterator[IntoIterable]] = _into_iter_selector() | ||
| """`into_iter` fixtures use the suffix `_<n>` to denote the maximum number of constructors. | ||
| Anything greater than **10** may return less depending on available dependencies. | ||
| """ | ||
|
|
||
|
|
||
| @pytest.fixture(params=_into_iter(16), scope="session", ids=_ids_into_iter) | ||
| def into_iter_16(request: pytest.FixtureRequest) -> IntoIterable: | ||
| function: IntoIterable = request.param | ||
| return function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda outdated now since (633a06d)
narwhals/expr.py
Outdated
| if isinstance(other, Iterable) and not isinstance(other, (str, bytes)): | ||
| other = other.to_native() if is_series(other) else iterable_to_sequence(other) | ||
| return self._with_elementwise( | ||
| lambda plx: self._to_compliant_expr(plx).is_in( | ||
| to_native(other, pass_through=True) | ||
| ) | ||
| lambda plx: self._to_compliant_expr(plx).is_in(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this but it is still wrong.
Didn't notice the issue until I started trying to add typing to the compliant-level
narwhals/narwhals/_compliant/column.py
Line 173 in 849326f
| def is_in(self, other: Any) -> Self: ... |
I don't understand why we were allowing any kind of Native* to be passed to every backend? π€
Gonna add another test for at least the more common case of a Series from the wrong backend
Updated
test: Add test_expr_is_in_series_wrong_backend
I'm not sure of a safe way to keep the same behavior (if it is desired).
Since we don't know the backend at this stage, the options I see are:
- Unconditionally convert to
list | tuple - Raise elsewhere when a
Seriesis passed to a lazy backend - Disallow
Expr.is_in(other: Series) - Do nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli @FBruzzesi
do you guys have any preference on a path forward here?
This case is a bit different to some of the other places we'd reject nw.Series for lazy backends - since the length isn't an issue.
But the safer option of converting all nw.Series is gonna be less performant than the currently unsafe version (which only works on a matching eager implementation)
Everything seems like a tradeoff to me π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved out of draft for visibility (#3207 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dangotbanned I will try to take a look during the weekend, but I am not sure I can manage.
My thoughts from the context in the messages here (I didn't open the code changes just yet):
- Realistically I don't think it's too common to pass one series from a different backend but we should not exclude such possibility
- My opinion would be to:
- Use the native series if
isinstance(other, Series) and expr._implementation == other._implementation) - Otherwise convert it to a list (we can do that with native methods thankfully), but warn that such conversion is happening with a
UserWarning. This case includes a series passed to a lazy backend
- Use the native series if
Update: Regarding native series, then I would prefer to raise an exception in such case
{Expr,Series}.is_in(other: Iterable) {Expr,Series}.is_in(other: Iterable)
Will close #3195
What type of PR is this? (check all applicable)
Related issues
duckdb>=1.4.1typing & warningsΒ #3189 (comment)Series.from_iterableΒ #2933 (comment)Checklist
If you have comments or can explain your changes, please do so below
I'm expecting most of the diff to be from this commit:
Series.from_iterableconstructorsThe fixing per-backend should only be a couple lines or handling it in
nw.Expriterable_to_sequenceExpr.is_in(other: Series)